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Abstract 

We propose a similarity-based method, using the similarity between nodes, to ad- 
dress the problem of classification in partially labeled networks. The basic assump- 
tion is that two nodes are more likely to be categorized into the same class if they 
are more similar. In this paper, we introduce ten similarity indices, including five 
local ones and five global ones. Empirical results on the co-purchase network of 
political books show that the similarity-based method can give high accurate clas- 
sification even when the labeled nodes are sparse which is one of the difficulties in 
classification. Furthermore, we find that when the target network has many labeled 
nodes, the local indices can perform as good as those global indices do, while when 
the data is spares the global indices perform better. Besides, the similarity-based 
method can to some extent overcome the unconsistency problem which is another 
difficulty in classification. 

Key words: complex networks, similarity, classification, labeled network 
PACS: 89.20.Ff, 89.75.Hc, 89.65.-s 



1 Introduction 

Recently, the problem of within-network classification in partial labeled net- 
works has attracted much attention. Given a network with partial nodes being 
labeled, the problem is to predict the labels of these unlabeled nodes based 
on the known labels and the network structure. Many algorithms have been 
proposed. These methods can be widely applied to many fileds, such as the 

* Email address: linyuan. lue@unifr.ch (Linyuan Lii) 



Preprint submitted to Physics A 



3 March 2010 



hypertext categorization [1,2], distinguishing the fraud and legit users in cell 
phone network [3], detecting whether an email is for a certain task [4] and pre- 
dicting the disease-related genes [5]. Generally speaking, the known methods 
can be classified into two groups. One is collective classification, which refers 
to the combined classification by using three types of correlations: (1) between 
the node's label and its attributes, (ii) between node's label and its neighbor's 
attributes, (iii) between node's label and its neighbor's label (see a brief in- 
troduction in Ref. [6]). One remarkably advantage of this method is its high 
ability to learn the dependency structure, such as positive or negative corre- 
lation (i.e. consistency or unconsistency). However, when the labeled nodes 
are sparse, this method is difficult to give accurate classification. The sparse 
problem can be solved by another group of methods, named semi-supervised 
learning, which make use of both labeled and unlabeled data for training (see 
Ref. [7] for more information). The latent assumption of this method is the 
consistency with the label information, namely the nearby nodes tend to have 
the same label. Therefore when this assumption does not hold the performance 
of this method will be largely degraded. Brian et al. proposed a method by 
adding ghost edges between every pair of labeled and unlabeled node to the 
target network, which enable the flow of information from the labeled nodes to 
the unlabeled nodes [3]. They assigned a weight to each ghost edge based on 
the score of the two endpoints obtained by the Even-step random walk with 
restart (Even-step RWR) algorithm. The experimental results on real- world 
data showed that their method can to some extent solve the sparse problem 
and negative correlation problem (i.e. unconsistency), and perform well while 
the existing approaches, such as collective classification and semi-supervised 
learning, will fail. In this paper, we compare the performances of Even-step 
RWR index with other nine similarity indices which have been widely used 
in link prediction problem [8,9,10]. These include five local indices, namely 
the Common Neighbors [11], Jaccard coefficient [12], S0rensen index [13], 
Adamic-Adar index [14] and Resource Allocation index [9], and four global 
indices, namely Katz index [15], Average Commute Time [16], cosine based 
on the Pseudoinverse of the Laplacian matrix (cos + ) and Random walk With 
Restart (RWR) [17]. In addition, we also consider a simple relational neighbors 
algorithm, which claims that an unlabeled node tends to have the same label 
with its neighbors [18]. Empirical results on the co-purchase network of po- 
litical books show that the similarity-based methods perform better than the 
relational neighbors algorithm. Especially when the labeled nodes are sparse, 
the improvement is prominent. Furthermore, when the data is dense, the local 
indices perform as good as the global indices, while when the data is spare the 
global indices will perform better. 

The rest of this paper is organized as follows. In section 2 we introduce ten 
similarity indices, including five indices based on local information and others 
based on global information. Section 3 describes the metric to evaluate the 
algorithm's accuracy. Section 4 shows the experimental results of the ten in- 
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dices on the co-purchase network of political books. Finally, we conclude this 
paper in section 5. 



2 Similarity indices 



We consider five local similarity indices as well as five global ones. All are 
defined based on the network structure. A short introduction of each index is 
shown as: 

(1) Common Neighbors — For a node x, let T(x) denote the set of neighbors 
of x. By common sense, two nodes, x and y, are more similar if they have 
many common neighbors. The simplest measure of this neighborhood overlap 
is the directed count, namely 

C = |r(x)nr( y )|. (i) 



where \Q\ is the cardinality of the set Q. It is obvious that s xy = (A 2 ) xy , where 
A is the adjacency matrix, in which A xy = 1 if x and y are directly connected 
and A xy = otherwise. Note that, (A 2 ) xy is also the number of different paths 
with length 2 connecting x and y. 

(2) Jaccard Index [12] — This index was proposed by Jaccard over a hundred 
years ago, and is defined as 

Jaccard = \^( X ) [} 

xy ' \T(x)UT(y)\- [) 



(3) S0rensen Index [13] — This index is used mainly for ecological community 
data, and is defined as 

S0rensen = 2x \T(x)nT(y)\ 

xy k(x) + k(y) ' {) 



(4) Adamic-Adar Index [14] — This index refines the simple counting of com- 
mon neighbors by assigning the less-connected neighbors more weight, and is 
defined as: 



(5) Resource Allocation [9] — Consider a pair of nodes, x and y, which are 
not directly connected. The node x can send some resource to y, with their 
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common neighbors playing the role of transmitters. In the simplest case, we 
assume that each transmitter has a unit of resource, and will equally distribute 
it between all its neighbors. The similarity between x and y can be defined as 
the amount of resource y received from x, which is: 

«£ = E (5) 

zer(x)nr( y ) 



Clearly, this measure is symmetric, namely s xy = s yx . Note that, although 
resulting from different motivations, the AA index and RA index have the 
very similar form. Indeed, they both depress the contribution of the high- 
degree common neighbors in different ways. AA index takes the logk(z) form 
while RA index takes the linear form. The difference is insignificant when 
the degree, k, is small, while it is great when k is large. Therefor, RA index 
punishes the high-degree common neighbors heavily. 

(6) Katz Index [15] - This measure is based on the ensemble of all paths, 
which directly sums over the collection of paths and exponentially damped 
by length to give the short paths more weights. The mathematical expression 
reads 

oo 

= \paths<l> | = [3A + (3 2 A 2 + (3 3 A 3 + • • • , (6) 

i=i 



where paths^y is the set of all paths with length / connecting x and y, and 
j3 is a free parameter controlling the weights of the paths. Obviously, a very 
small j3 yields a measure close to CN, because the long paths contribute very 
little. The S matrix can be written as (I — fiA)^ 1 — I. Note that, (5 must be 
lower than the reciprocal of the maximum of the eigenvalues of matrix A to 
ensure the convergence. 

(7) Average Commute Time [16] — Denoting by m(x,y) the average number 
of steps required by a random walker starting form node x to reach node y, 
the average commute time between x and y is n(x,y) = m(x,y) + m(y,x), 
which can be computed in terms of the Pseudoinverse of the Laplacian matrix 
L + , as: 

n(x,y) = E(l+ e + C „-2J+), (7) 



where l£ denotes the corresponding entry in L + . Assuming two nodes are 
considered to be more similar if they have a small average commute time, 
then the similarity between the nodes x and y can be defined as the reciprocal 
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of n(x, y), namely 



act = 

°xy 



1+ 

yy 



(8) 



(8) Cosine based on L + [16] — This index is an inner-product based measure, 
which is defined as the cosine of node vectors, namely 

slf = cos(x,y)+ = 1 Jk=. (9) 

^xx ' ^yy 



(9) Random walk with restart [17] — This index is a direct application of the 
PageRank algorithm. Consider a random walker starting from node x, who 
will iteratively moves to a random neighbor with probability c and return 
to node x with probability 1 — c. Denote by q xy the probability this random 
walker locates at node y in the steady state, then we have 

q x = cP T q x + (1 - c)e x , (10) 



where e" x is an N x 1 vector with the x th element equal to 1 and others all equal 
to 0, and P T = AD^ 1 where = 5ijki. The solution is straightforward, as 

q x = (l-c)(I-cP T )- 1 e- x . (11) 



Then we define the similarity between node x and node y equals s xy = q xy +q yx . 

(10) Even-step RWR [3] — To avoid the immediate neighbors, we can consider 
only the even-length paths. Mathematically, we should replace the transition 
matrix with M = (P T ) 2 . 

For comparison, we compare the above-mentioned ten indices with the sim- 
plest method, says Relational Neighbors (RN) [18]. Given an unlabeled node 
u, the probability that its label is U equals 

p(k\u) = ^iK^l^c^H,} (12) 

\V \{v>>er(u)\label(v'/)^0} 



where V is the set constituted by u's neighbors whose label is l iy and V" is 
the set of w's neighbors being labeled. 
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Node 2 



Node 1 




Fig. 1. (Color online) An illustration of how to predict the node's label according 
to the similarity. 

3 Method 



Consider an unweighted undirected network of both labeled and unlabeled 
nodes: G(V, E, L), where V is the set of nodes, E is the set of links and 
L = {h, h, ■ ■ ■ , l m } is the set of labels. For each pair of nodes, x and y, every 
algorithm referred in this paper assigns a score as s xy . For an unlabeled node 
it, the probability that it belongs to /, is 

/, I \ ^2{v\label(v)=L} &U,v ,^ oN 

P(h\u) = ^ , (13) 



where U G L. The predicted label of node u is determined by the largest p{li\u). 
If there are more than one maximum values, we randomly select one. A simple 
example is shown in Fig. 1, where there are two kinds of labels (i.e. a and b) 
and five nodes, four of which are labeled already. Our task is to predict the 
label of the node 5. According to the common neighbors algorithm, we obtain 
the similarity between node 5 and the other four labeled nodes, and then we 
infer that the probability that node 5 is labeled by a equals 3/4. 

To test the algorithm's accuracy, all the labeled nodes are randomly divided 
into two parts: the training set, V T , is treated as known information, while 
the probe set, V p , is used for testing. We denote q the proportion of labeled 
nodes divided into training set, which is considered as the density index. A 
smaller q indicates a sparser labeled network. The accuracy is quantified by 
the probability that we predict right. For a testing node u G V p whose label 
is U, if p(li) > p(lj),j 7^ i, we predict right, and thus q u = 1. If there is 
n maximum values corresponding to n different labels and the right label is 
one of them, we have q u = 1/n. Run over all the testing nodes we have the 
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accuracy equals 



Accuracy = 



W p \ 



•v 



(14) 



where \V | is the number of nodes in the probe set. For example, if there 
are two categories in the target network, namely l\ and / 2 , accuracy can be 
obtained by 



where n' is the number of nodes in probe set being predicted right and n" is 
the number of nodes u e V p having the same probability of two labels (i.e. 
p{h\u) = p{h\u)). 



4 Empirical results 

We compare the above-mentioned ten similarity indices on the co-purchases 
network of political books [19]. This network contains 105 nodes (books) and 
441 edges. All books are classified into three categories, neutral, liberal and 
conservative. For simplicity, we start the experiments with the sampled net- 
works containing only two classes. Therefore, we sample three labeled networks 
with three tasks as follows: 

Task 1: Whether an unlabel node is neutral? For this task, we label the books 
which are neutral by a and others by b (i.e. not neutral). 

Task 2: Whether an unlabel node is liberal? For this task, we label the books 
which are liberal by a and others by b (i.e. not liberal). 

Task 3: Whether an unlabel node is conservative? We label the books which 
are conservative by a and others by b (i.e. not conservative). 

Table. 1 summarize the basic statistics of these three sampled networks cor- 
responding to task 1, task 2 and task 3 respectively. N(x) (x = a,b) is the 
number of nodes labeled by x. E(x) indicates the number of edges connecting 
to the nodes labeled by x. Denote by M(x) the number of edges whose two 
endpoints have the same label x, then C(x) = M(x)/E(x) indicats the local 
consistency of the subgraph constituted by the nodes labeled by x and the 
edges connecting to these nodes. C is the local consistency of the whole net- 
work, which reads C = ■ M (°)+ M ( 6 ) ; where E is the total number of edges of the 
whole network (here E = 441). Note that, E < E(a) + E(b). Here, we further 



Accuracy 



ri + 0.5 
\V P \ 



(15) 



7 




C=0 C=2/6=l/3 C=3/6=l/2 C=l 

C(a)=C(b)=0 C(a)=C(b)=l/5 C(a)=l/2, C(b)=0 C(a)=l, C(b)=0 

C2=l C2=l/3 C2=l/2 C2=l 



Fig. 2. (Color online) Illustration of the calculation of local consistency and two-step 
consistency. 

Table 1 

The summary of local consistency of each label and each sampled networks. N(a) 
and N(b) are the number of nodes labeled by a and b respectively. E{a) and E(b) 
indicate the number of edges connecting to the nodes labeled by a and b respectively. 
C(a) and C(b) are the local consistency of the nodes labeled by a and b respectively. 
C and C2 are the local consistency and two-step consistency of the sampled network, 
respectively. 
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N(a) 


N(b) 


E(a) 


E(b) 


M(a) 


M(b) 


C(a) 


C(b) 


C 
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Netl 


13 


92 


67 


432 


9 


374 


0.134 


0.866 


0.869 


0.864 


Net2 


43 


62 


208 


269 


172 


233 


0.827 


0.866 


0.918 


0.894 


Net3 


49 


56 


236 


251 


190 


205 


0.805 


0.817 


0.890 


0.882 



develop the definition of local consistency to two-step consistency denoting by 
C<i which equals to the number of path with length 2 whose two endpoints 
have the same label divide by the number of the path with length 2. Clearly, 
the common neighbor index will perform well in the network with high C^. 
Four simple examples of calculating C(x), C and C2 are shown in Fig. 2. One 
can see that in the first graph, because of C = 0, RN will perform very bad, 
while CN performs very good [C2 = 1). However in the forth graph both RN 
and CN can give good performance. 

Comparison of the ten similarity indices on three sampled networks are shown 
in Fig. 3. The subgraphs (a), (c) and (e) show the results of the local indices, 
while (b), (d) and (f) report the results of the global indices. It is interesting 
that all these five local indices give almost the same results especially when 
the density of labeled nodes is small. This is because all these five indices 
are common-neighbor based and when q is small whether an unlabeled node 
relevant with a labeled node play a more important role than the exact corre- 
lation (similarity score) between them. Furthermore, because of the high C2 
of these three networks, all the common-neighbor-based indices performs well 
and even when the data is sparse they can give much better prediction than 
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Fig. 3. (Color online) Comparison of ten similarity indices on three sampled networks 
containing two categories, (a) and (b) are the results of the local and global indices 
for task 1 respectively, (c) and (d) are the results of the local and global indices for 
task 2 respectively, (e) and (f) are the results of the local and global indices for task 
3 respectively. For RWR index we set c = 0.1. Each number is obtained by averaging 
over 1000 implementations with independently random division of training set and 
probe set. 
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Fig. 4. (Color online) Comparison of ten similarity indices on the network taking 
into account three categories. For RWR we set c = 0.1. Each number is obtained 
by averaging over 1000 implementations with independently random division of 
training set and probe set. 

RN. Compare with global indices, the local indices can give competitively ac- 
curate classification when q is large, but when the labeled data is sparse, for 
most unlabeled node it is too difficult to find a labeled node nearby, and thus 
the global indices will perform better. Among these five global indices, the 
performance of Katz index, RWR and even-step RWR are stable, while the 
performance of ACT and cos + are not. For example, in sampled network 1, 
the ACT index performs very well but cos + is even worse than pure chance. 
However, in sampled network 3, the cos + index preforms the best but the ACT 
index performs even worse than the simplest method RN. 

Obviously, it will be more difficult to obtain highly accurate classification when 
we consider many categories together. We futher carry out an experiment on 
the network containing all the three categories. Our task is to detect the 
category of an unlabel book, namely is it neutral, liberal or conservative? We 
label the books by n (i.e. neutral), I (i.e. liberal) and c (i.e. conservative) 
according to their categories. The local consistency and two-step consistency 
of this network are 0.8413 and 0.8204 respectively, which are all lower than 
the three sampled networks containing only two classes, and thus the accuracy 
is also lower, as shown in Fig. 4. One can see that the results are similar to 
the one on the sampled network 3 where the biggest class, conservative, is 
considered. This result demonstrates that the majorities play the main role. 



5 Conclusion and Discussion 



In this paper, we investigated the similarity-based classification for partial la- 
beled network. The basic assumption is that two nodes are more likely to have 
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the same label if they are more similar to each other. We introduced ten simi- 
larity indices which have been widely used to solve the link prediction problem 
of complex networks, including five common-neighbor-based indices, namely 
Common Neighbors, Jaccard coefficient, S0rensen index, Adamic-Adar index 
and Resource Allocation index, and five global indices, namely Katz index, 
Average Commute Time, cosine based on the Pseudoinverse of the Laplacian 
matrix (cos + ), Random walk With Restart (RWR) and Even-step RWR. We 
carried out the experiments on the co-purchase network of political books. 
The results showed that the similarity-based classification perform much bet- 
ter than the relational neighbors algorithm, especially when the labeled nodes 
are sparse. Furthermore, we found that when the data is dense the local in- 
dices can perform as good as the global indices. However, when the data is 
sparse, for an unlabeled node it is too difficult to find a labeled node nearby, 
and thus the global indices perform better. Compare with the former proposed 
algorithms the group of similarity-based classification methods has three ad- 
vantages: firstly, it can to some extent solve the sparse data problem by using 
the global indices; secondly, when the network consistency assumption is not 
hold it can still give high accurate classification; thirdly, without any learning 
process this method has lower calculation complexity than other complicated 
methods. 

However, there are still some open problems left. For example what is the 
relation between the network label structure and the performance of each 
similarity index. In-depth analysis on the modeled networks may be helpful, 
where we can control the label density, network consistency and also the pro- 
portion of each class. Anyway, we hope this work can provide a novel view for 
the study of classification in partial labeled networks and we believe that there 
is still a large space for further contribution. For example, when the number 
of nodes in one class is much lager than in the others, the unlabeled nodes 
are more likely to have the same labels with the majority. To solve this prob- 
lem we can only consider the top- A; similar labeled nodes when calculate the 
probability. In addition, we can also use negative correlation in the adjacent 
matrix A directly, namely for the nonzero element in A if the node x and y 
have the different labels, we set A xy = —1. To do this, we can not only obtain 
the strength of the correlation between the unlabeled node and the labeled 
one but also know the correlation type, positive or negative. 
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